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SAMPLING FROM LARGE MATRICES: AN APPROACH 
THROUGH GEOMETRIC FUNCTIONAL ANALYSIS 

MARK RUDELSON AND ROMAN VERSHYNIN 

Abstract. We study random submatrices of a large matrix A. We show how to 
approximately compute A from its random submatrix of the smallest possible size 
O(rlogr) with a small error in the spectral norm, where r = ||A|||,/||A||| is the 
numerical rank of A. The numerical rank is always bounded by, and is a stable 
relaxation of, the rank of A. This yields an asymptotically optimal guarantee in an 
algorithm for computing low-rank approximations of A. We also prove asymptotically 
optimal estimates on the spectral norm and the cut-norm of random submatrices 
of A. The result for the cut-norm yields a slight improvement on the best known 
sample complexity for an approximation algorithm for MAX-2CSP problems. We 
use methods of Probability in Banach spaces, in particular the law of large numbers 
for operator-valued random variables. 


1. Introduction 

This paper studies random submatrices of a large matrix A. The study of random 
submatrices spans several decades and is related to diverse areas of mathematics and 
computer science. Two main reasons for the interest in random submatrices are: 

(1) one can learn properties of A from the properties of its random sub matrices; 

(2) properties of A may improve by passing to its random submatrices. 

We address both aspects of random submatrices in this paper. We show how to ap¬ 
proximate A by its random submatrix in the spectral norm, and we compute the 
asymptotics of the spectral and the cut norms of random submatrices. This yields im¬ 
provements upon known algorithms for computing low rank approximations, Singular 
Value Decompositions, and approximations to MAX-2CSP problems. 

1.1. The spectral norm: low rank approximations and SVD. Can one approx¬ 
imate A by only knowing a random submatrix of A of a fixed size? If so, what is the 
sample complexity, the minimal size of a submatrix which yields a good approximation 
with a small error in some natural norm, and with high probability? 
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This problem belongs to a class of problems the Statistical Learning Theory is con¬ 
cerned with. These problems inevitably bear the assumption that the the object to be 
learned belongs to a relatively small “target” class. To be able to learn A from a matrix 
of small size thus of small rank, we have to assume that A itself has small rank -or can 
be approximated by an (unknown) matrix of a small rank. We thus strive to find a 
low rank approximation to a matrix A, whenever such an approximation exists, from 
only knowing a small random sub matrix of A. 

Solving this problem is essential for development of fast Monte-Carlo algorithms for 
computations on large matrices. An extremely large matrix A - say, of the order of 
10 5 x 10 5 - is impossible to upload into the Random Access Memory (RAM) of a 
computer; it is instead stored in an external memory. On the other hand, sampling a 
small submatrix of A, storing it in RAM and computing its small rank approximation 
is feasible. 

The crucial assumption that A is essentially a low rank matrix holds in many ap¬ 
plications. For example, this is a model hypothesis in the Latent Semantic Indexing 
(see mm si s m m- There A is the “document-term matrix”, which is formed of 
the frequencies of occurrence of various terms in the documents of a large collection. 
The hypothesis that the documents are related to a small number of (unknown) topics 
translates into the assumption that A can be approximated by an (unknown) low rank 
matrix. Finding such an approximation would determine the “best” topics the collec¬ 
tion is really about. Other examples where this problem arises include clustering of 
graphs [9j, DNA microarray data, facial recognition, web search (see HU), lossy data 
compression and cryptography (see 0). 

The best fixed rank approximation to A is obviously given by the partial sums of 
the Singular Value Decomposition (SVD) 

A = crj(A) Uj 0 Vj 
o 

where &j(A) is the nonincreasing and nonnegative sequence of the singular values of 
A, and Uj and Vj are left and right singular vectors of A respectively. The best rank k 
approximation to A in both the spectral and Frobenius norms is thus AP where P f. 
is the orthogonal projection onto the top k left singular vectors of A. In particular, for 
the spectral norm we have 

R mi LR<u ^ A ~ B II 2 = W A ~~ APk II 2 = ak +^ A )- ( L1 ) 

However, computing P *., which gives the first elements of the SVD of a m x n matrix 
A, is often impossible in practice because (1) it would take many passes through A, 
which is prohibitively slow for a matrix stored in an external memory; (2) this would 
take superlinear time in m + n. Instead, it was proposed in [15j [10|, lly [12] to use the 
Monte-Carlo methodology: namely, approximate the fc-th partial sum of the SVD of 
A by the fc-th partial sum of the SVD of a random submatrix of A. In this paper, we 
show that this can be done: 
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(1) with almost linear sample complexity 0(r log r), that is by sampling only 0(r log r) 
random rows of A, if A is approximable by a rank r matrix; 

(2) in one pass through A if the matrix is stored row-by-row, and in two passes if 
its entries are stored in arbitrary order; 

(3) using RAM space and time 0(n + m) (and polynomial in r and k). 

Theorem 1.1. Let A be an m x n matrix with numerical rank r = ||A||^/ ||A|| 9 . Let 
e, 5 G (0,1) , and let d < m be an integer such that 

Consider a d x n matrix A, which consists of d normalized rows of A picked indepen¬ 
dently with replacement, with probabilities proportional to the squares of their Euclidean 
lengths. Then with probability at least 1 — 2 exp {—c/8) the following holds. For a pos¬ 
itive integer k, let Pk be the orthogonal projection onto the top k left singular vectors 
of A. Then 

\\A — AP k \\ 2 < <7fc + i(v4) + e \\A\\ 2 . (1-3) 

Here and in the sequel, C, c, .. . denote positive absolute constants. 

Comparing (II. 3ft with the best approximation (11.11) given by the SVD, we see an 
additional error £ ||A|| 9 which can be made small by increasing the size d of the sample. 

Remark 1.2 (Optimality). The almost linear sample complexity d = 0(r logr) achieved 
in Theorem 11.11 is optimal, see Proposition 13.91 below. The best known previous re¬ 
sult, due to Drineas, Kannan and Mahoney, had with the quadratic sample complexity 
d = 0(r 2 ) ([llj Theorem 4, see also [12] Theorem 3). The approximation scheme in 
Theorem 11.11 was developed in [15], TOl fill fl2] . 

Remark 1.3 (Numerical rank). The numerical rank r = r(A) = || A\\ 2 F / || A \\ 2 in 
Theorem 11.11 is a relaxation of the exact notion of rank. Indeed, one always has 
r(A) < rank(A). But as opposed to the exact rank, the numerical rank is stable 
under small perturbations of the matrix A. In particular, the numerical rank of A 
tends to be low when A is close to a low rank matrix, or when A is sufficiently sparse. 

So results like Theorem o which depend on the numerical rather than exact rank, 
should be useful in many applications, such as the Principal Component Analysis. 

Remark 1.4 (Law of large numbers for operator-valued random variables). The new 
feature in our proof of Theorem 11.11 is a use of the first author’s argument about 
random vectors in the isotropic position [2D]. It yields a law of large numbers for 
operator-valued random variables. We apply it for independent copies of a rank one 
random operator, which is given by a random row of the matrix A T A. 
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1.2. The cut-norm: decay, MAX-CSP problems. Alon, Fernandez de la Vega, 
Kannan and Karpinski 00 reduced the problem of additive approximation of the 
MAX-CSP problems (which are NP-hard) to computing the cut-norm of random sub¬ 
matrices. The cut norm of an n x n matrix A is the maximum sum of the entries of 
its submatrix, 

Pile = max A v » 

i£l,jeJ 

and it is equivalent to the C*, —> £± operator norm. 

The problem is to understand how the cut norm of A decreases when we pass to its 
random submatrix. Let Q be a a random subset of {1,... ,n} of expected cardinality 
q. This means that each element of {1,... ,n} is included into Q independently with 
probability q/n. We form a Q x Q random submatrix A\q x q = ( Aij) iy j & Q. 

Intuitively, A|q x q is {q/n) 2 times smaller than A if A is diagonal-free, but only {q/n) 
times smaller than A if A is a diagonal matrix. We prove a general estimate of the 
cut-norm of random submatrices, which combines both of these types of decay: 

Theorem 1.5. Let A be an n x n matrix. Let Q be a random subset of {1,..., n} of 
expected cardinality q. Then 

EPIqxcIIc < o((i)V - D(A)\\c + (£)||£KAIIc + (“) ;i/ 2 (ll' 4 llc°' + IlYllci)), 

where 11A11ooi is the sum of the Euclidean lengths of the columns of A, and D{A) is the 
diagonal part of A. 

Remark 1.6 (Optimality). The estimate in this theorem is optimal, see Section l4~2l 

We now state a partial case of Theorem 11.51 in the form useful for MAX-CSP prob¬ 
lems. Note that ||A|| Col < A/n||A|| F . Then we have: 

Corollary 1.7. Under the hypotheses of Theorem \1.5[ let q = fl(£~ 2 ). Assume that 
Pile = 0{en 2 ), and ||Ap = 0{n), ||A||oo = 0{e _1 ), where ||A||oo denotes the maxi¬ 
mum of the absolute values of the entries of A. Then 

e PIqxq||c = 0{eq 2 ). 

In solving MAX-2-CSP problems, one approximates the edge-weight matrix W of 
the graph on n vertices by a cut approximation W\ and checks that the the error 
matrix A = W — W' satisfies the assumptions of Corollary 11.71 see [13, H 3j. Hence 
the Corollary says that for a random induced graph on q = fl(e“ 2 ) vertices, the same 
cut-approximation (induced on the q vertices of the random subgraph) works. Namely, 
the error in cut-norm is at most eq 2 . 

A weaker form of Corollary 11.71 was proved by Alon, Fernandez de la Vega, Kan¬ 
nan and Karpinski [3], Theorem 6. Their result has a bigger sample complexity 
q = H(e -4 log(l/£)) and an extra assumption n = e^ £ \ but it works for multidi¬ 
mensional arrays rather than for matrices (=2-dimensional arrays). 
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Using Corollary 1 1.71 instead of [3] Theorem 6 slightly improves the best known sample 
complexity for the approximation algorithm for MAX-2-CSP problems due to [3] . The 
solution to a MAX-2SCP problem on n variables can be approximated within the 
additive error en 2 by the solution of the problem induced by a randomly chosen q 
variables. The best known sample complexity, due to [3], is q = 0{e~ A log(l/e)). Using 
Corollary 11.71 in the argument of [3] Theorem 6 improves the sample complexity to 

q = 0(£~ a ). 

Our proof of Theorem 11.51 uses the technique of probability in Banach spaces, and 
includes decoupling, symmetrization, and application of a version of Slepian’s lemma 
for Bernoulli random variables due to Talagra.nd. 

1.3. The spectral norm: decay. Perhaps the most important matrix norm is the 
spectral norm. Nevertheless, its decay under passing to submatrices has not been 
sufficiently understood. 

Let A be an n x n matrix, and Q be a random subset of {1,... ,n} of expected 
cardinality q (as above). We consider a random row-submatrix A\q = (Ay) ie gj< T l , 
which consists of the rows of A in Q. 

When one orthogonally projects a vector x G R n onto RP, its Euclidean length 
reduces in average by the factor of So, one should expect a similar type of decay 
for the spectral norm - something like E||A|q|| 2 < y^"||A|| 2 . 

However, similarly to the previous section, the diagonal matrices exhibit a different 
type of decay. For example, there is no decay at all for the identity matrix. One can 
check that the correct order of decay for diagonal matrices is 

||A||(fc) = the average of k biggest Euclidean lengths of the columns of A, 

where k = n/q. General matrices again combine both types of decay: 

Theorem 1.8. Let A be an n x n matrix. Let Q be a random subset of {1,..., n} of 
expected cardinality q. Then 

e||A|q || 2 < ^(y^ll^ll 2 + ll^ll(n/ 9 ))- 

Remark 1.9 (Optimality). The estimate in this theorem is optimal. The example 
considered in the proof of Proposition 13.91 below shows that the coefficient y'log q is 
necessary. 

Generalizing an earlier result of Lunin [18], Kashin and Tzafriri [TB] (see [22]) essen¬ 
tially proved the existence of a subset Q of cardinality q and such that 

PI*<o(yfpi| 2 + Mk). 

Note that ■kjk = (y )Cr=i lAP) 1 ^ is the average of the lengths of all columns of A. 
As the example of diagonal operators shows, for random subsets Q this term has to be 
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replaced by the average of the few biggest columns. Talagrand EU proved deep results 
on the more general operator norms £2 -X, where X is a 2-smooth Banach space. 

However, the decay on j in his results is logarithmic rather than polynomial. 

1.4. Stable matrix results. Many problems on random submatrices, of both theo¬ 
retical and practical importance, have functional-analytic rather than linear-algebraic 
nature. These problems, like those this paper considers, are about estimating operator 
norms. We thus see a matrix A as a linear operator A between finite dimensional 
normed spaces - say, between and ££ f° r the spectral norm, and between and l\ 
for the cut norm. 

From this perspective, the dimension n of the ambient normed space should play 
a minor role, while the real control of the picture should be held by (hopefully few) 
quantities tied to the operator rather than the space. As a trivial example, if A is 
not of full rank then the dimension n is useless compared to the rank of A. Further, 
we are looking for stable results, those not ruined by small perturbations of the linear 
operators. This is a natural demand in applications, and this differs our analytic 
perspective from the linear algebraic one. It would thus be natural to look for stable 
quantities tied to linear operators, which govern the picture. For example, operator 
norms are stable quantities, while the rank is not. 

This paper advances such approach to matrices. The low rank approximations in 
Theorem 11.11 are only controlled by the numerical rank r(A) = ||A|||i/||A||| of the 
matrix, which is a stable relaxation of the rank. The norms of random matrices in 
Theorems 11.51 and 11.81 are essentially controlled by the norms of the original matrix 
(and naturally by the sampling factor, the ratio of the size of the submatrix to the size 
of the original matrix). The dimension n of the matrix does not play a separate role 
in these results (although the matrix norms may grow with the dimension). 

Acknowledgement. This project started when the authors participated in the 
PIMS Thematic Programme on Asymptotic Geometric Analysis at the University of 
British Columbia in Summer 2002. The first author was a PIMS postdoctoral fellow 
at that time. We are grateful to PIMS for its hospitality. The final part of this 
research was done when the first authour visited University of California, Davis. We 
are grateful for R. Kannan for his comments on the initial version of this paper, and 
to M. Karpinski for explaining what was the correct consequence of Corollary 11.71 for 
MAX-2-CSP problems. Finally, we thank the referees for their valuable comments and 
suggestions. 


2. Notation 

For p < 00 , the finite dimensional £ p spaces are denoted by £ p . Thus is the Banach 
space (M n , || • || p ), where ||x|| p = (^” =1 | x j| p ) 1/,p for p < 00 , and ||a;||oo = max* \xi\. The 
closed unit ball of £ p is denoted by B p := {a; | ||ar|| p < 1}. 
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The canonical basis of R” is denoted by (ei,. .., e n ). Let x, y G R". The canonical 
inner product is denoted by (x,y) := x T y. The tensor product is defined as x 0 y : = 
yx T \ thus (x ® y)z = (x, z) y for all z 6 R n . 

Let A = (Aij)ij be an m x n real matrix. The spectral norm of A is the operator 
norm i 2 —> 0, defined as 

P|| 2 := sup = ai{A), 

a;eR n |p||2 

where (T\(A) is the largest singular value of A. The Frobenius norm ||A||^ of A is 
defined as 

m \ 2 F--=J2 A l = J2 a i( A ) 2 . 

i,j 3 

where crj(A) are the singular values of A. 

Finally, C,C\,c,ci,... denote positive absolute constants. The a = 0(b) notation 
means that a < Cb for some absolute constant C. 


3. Low RANK APPROXIMATIONS 

In this section, we prove Theorem 11.11 discuss the algorithm for finding low rank 
approximations, and show that the sample complexity in Theorem 11.11 is optimal. 
Our argument will be based on the law of large numbers for operator-valued random 
variables. 


3.1. Law of large numbers for operator-valued random variables. Theorem ll.il 

is about random independent sampling the rows of the matrix A. Such sampling can 
be viewed as an empirical process taking values in the set of rows. If we sample enough 
rows, then the matrix constructed from them would nicely approximate the original 
matrix A in the spectral norm. For the scalar random variables, this effect is the 
classical Law of Large Numbers. For example, let X be a bounded random variable 
and let X\ .. . Xd be independent copies of A". Then 


E 




EA 


3 = 1 



(3.1) 


Furthermore, the large deviation theory allows one to estimate the probability that the 
empirical mean ^ Xp=i Xj stays close to the true mean EA. 

Operator-valued versions of this inequality are harder to prove. The absolute value 
must be replaced by the operator norm. So, instead of proving a large deviation es¬ 
timate for a single random variable, we have to estimate the supremum of a random 
process. This requires deeper probabilistic techniques. The following Theorem gener¬ 
alizes the main result of [20] , 
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Theorem 3.1. Let y be a random vector in M n ; which is uniformly bounded almost 
everywhere: ||j/|| 2 < M. Assume for normalization that ||Ey ® y || 2 < 1. Let yi... y d be 
independent copies of y. Let 


Then 

(i) If a < 1 then 


(ii) For every t G (0,1), 


E 


d 




^2yi®yi-^y ®y 

i= 1 


< a. 


- { ^ yi ® yi ~ ^y ® y o > < 2 exp (~ct 2 /a 2 ). 


i =1 


Remark 3.2. Part (i) is a law of large numbers, and part (ii) is a large deviation 
estimate for operator-valued random variables. Comparing this result with its scalar 
prototype (13. ip . we see an additional logarithmic factor. This factor is essential, as we 
show in Remark 13.41 below. 


Remark 3.3. The boundedness assumption 11 y 11 2 < M can be too strong for some 
applications. The proof of Theorem 13.11 shows that, in part (i), the boundedness almost 
everywhere can be relaxed to the moment assumption E||y||2 < M q , where q = log d. 
Part (ii) also holds under an assumption that the moments of ||y|| 2 have a nice decay. 
However, we do not need these improvements here. 


Remark 3.4. The estimate in Theorem 13.II is in general optimal. Indeed, consider the 
random vector y taking values \fne\, ..., \fne n each with probability 1/n, where (e*) 
is the canonical basis of M”. Then 'Ey <g>y — I. Then 


E 


-,J2 y i® y i 


j=i 


= E max 

2 i=l...n 


n 


-J \{j I Vj = \fnei}\ - 1 


If we want this quantity to be 0(1), then it is not hard to check that d should be of 
order at least nlogn. Therefore, the coefficient A/log (d)/d in Theorem 13.11 is optimal. 


3.2. Proof of Theorem 13.11 The proof consists of two steps. First we use the stan¬ 
dard symmetrization technique for random variables in Banach spaces, see e.g. na 
Section 6. Then we adapt the technique of [20] to obtain a bound on a symmetric 
random process. To obtain the probability estimate in part (ii), we shall estimate the 
high moments rather than the first moment in part (i). 

Let £\.. .£d denote independent Bernoulli variables taking values 1,-1 with proba¬ 
bility 1/2. Let yi ■ ■ - yd,yi ■ ■ - Vd be independent copies of y. We shall denote by E. y , E^ 
and E e the expectations according to (yi), (yi) and (e*) respectively. 


























SAMPLING FROM LARGE MATRICES 


9 


Let p > 1. We shall estimate 


E p := (E 


^ Vi 0 Vi _ E?/ 0 y 


i= 1 


p\ i/p 


(3.2) 


Note that E y y 0 y = Eg y ® y = Eg ^ Sf=i Vi ® Vij ■ We put this into (13.211 . Since 
x i—> ||x ||2 is a convex function on M ra , Jensen’s inequality implies that 

p\ i Ip 

UL w \ 7L- (5?) ?/. 

' d 

1=1 1=1 

Since Di®yi — yi® y% is a symmetric random variable, it is distributed identically with 
£i{Vi ® Vi~ Vi® Vi)- Thus 

d 


E p < E y Eg 


d i d 

-r Y y< 0 yi -1Y & 0 & 


E p < E y EgE £ 


d 


^2si(yi®yi-yi®yi) 


i=l 


p\ i Ip 


Denote Y = \ Yi=i £ dJj ® y% and y = \ Y=i £ iVi 0 Vi- Then ~ ^llf < (ll^lb + 
||y|| 2 r < 2 p (||y||f + ||y||D, and E||yH 2 = E\\Y\\l Thus we obtain 

d 


E p <2i E :y E e 


Y 0 y* 


i=l 


p\ 1 Ip 


We shall estimate the last expectation using a lemma from 

Lemma 3.5. Let y\ ■ ■ .yd be vectors in R k and let £\.. .£d be independent Bernoulli 
variables taking values 1, —1 with probability 1/2. Then 

d ^ , d 


E Y £iVi ® Vi 


p\ 1 /p 


1/2 

2 


< C 0 (p + log fc) 1/2 ■ max \\yt\\ 2 ■ Vi ® Vi 

1=1 ...a II L ' 

i=l i =1 

Remark 3.6. We can consider the vectors y\ . .. yd as vectors in their linear span, so 
we can always choose the dimension k of the ambient space at most d. 


Combining Lemma 13.51 with Remark 13.61 and using Holder’s inequality, we obtain 


E p < 2Cq- 


(p + logd) 1//2 „ _ /GJI P\!/2P 

~d 


' M ' 2 ) 


(3.3) 


i= 1 


By Minkowski’s inequality we have 


E Y yi ® yi 

i=l 

So we obtain 


p\ i/p 


< d 


E 


-iY Vi ® Vi ~^v®v 


i =1 


p\ i/p 


E p < 


ap 


1/2 


(. E p + 1), where a = 4C’ 0 


+ ||E?/ ® 2/1 |2 Yd(E p + 1). 

logd\ V 2 . 


M. 
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It follows that 

min(£'p, 1) < ap 1 ^ 2 . (3.4) 

To prove part (i) of the theorem, note that a < 1 by the assumption. It thus follows 
that E\ < a. This proves part (i). 

To prove part (ii), we can E p = (E Z p ) 1/p , where 




1 

d 


d 

^ Vi ® y* - % ® 2/ 

i =1 


2 


So (13.4(1 implies that 

(Emin(Z, l) p ) 1//p < min(E),, 1) < ap 1 ^ 2 . (3.5) 

This moment bound can be expressed as a tail probability estimate using the following 
standard lemma, see e.g. ra Lemmas 3.7 and 4.10. 

Lemma 3.7. Let Z be a nonnegative random variable. Assume that there exists a 
constant K > 0 such that (E Z p ) 1 l p < Kp 1 ^ 2 for all p > 1. Then 

F{Z > t} < 2exp(— c\t 2 /K 2 ) for all t > 0. 

It thus follows this and from (13.51) that 

P{min(Z, 1) > t} < 2 exp(— c\t 2 /a 2 ) for all t > 0. 

This completes the proof of the theorem. 


3.3. Proof of Theorem 11.11 By the homogeneity, we can assume || aL|| 2 = 1. 

The following lemma of Drineas and Kannan no (see also HP) reduces Theorem 
11.11 to a comparison of A and a sample A in the spectral norm. 

Lemma 3.8 (Drineas, Kannan). 

\\A - AP k \\\ < a k+1 (A ) 2 + 2||A T A - i T i|| 2 . 

Proof. We have 

||A — AP fc ||2 = sup 11 11 2 — sup (A T Ax,x) 

(cEkerPfc, ||x|| 2 =l rrEkerP/-, ||rr ||2 = l 

< sup ((A t A — A t A)x,x)+ sup ( A T Ax,x) 

(cGkerPfc, ||tc||2 = l rrEkerP^, ||a;||2=1 

= \\A T A — A t ^4|| 2 + <j k +i{A T A). 

By a result of perturbation theory, \<j k+ i(A T A) — a k+ i{A T A) \ < ||/l T Al — A r y4|| 2 . 
This proves Lemma [3.81 ■ 
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Let X\... x m denote the rows of the matrix A. Then 

m 

A T A = Xj 8) Xj. 

3 = 1 


We shall view the matrix A T A as the true mean of a bounded operator valued random 
variable, whereas A T A will be its empirical mean ; then we shall apply the Law of Large 
Numbers for operator-valued random variables - Theorem 13.11 To this end, define a 
random vector y G R m as 


P 





Let yi... yd be independent copies of y. Let the matrix A consist of rows ^=x/i ■ ■ • ^ Vd■ 
(The normalization of A here is different than in the statement of Theorem 11.11 in the 
proof, it is convenient to multiply A by the factor -^HAH^. However note that the 
singular vectors of A and thus P k do not change.) Then 

1 .4 ' __ 

A T A = Ey®y, A T A=-^y j ®y j , M := \\y\\ 2 = \\A\\ F = y/f. 

i= 1 

We can thus apply Theorem 13.11 Due to our assumption on d, we have 


a := 4 Cr 


^logd ^!/ 2 ^ £ 2 5 1 / 2 


d 




< 1 . 


Thus Theorem 13. II yields (with t = e 2 /2) that, with probability at least 1 — 2 exp (—c/5), 
we have 

\\A t A-A t A\\ 2 <^. 

Whenever this event holds, we can conclude by Lemma 13.81 that 

\\A — AP k \\ 2 < a k+l (A) + V2\\AAA — H t 4l|| 2 ^ < a k+ i(A) + s. 


This proves Theorem 11.11 


3.4. Algorithmic aspects of Theorem ll.il Finding a good low rank approximation 
to a matrix A amounts, due to Theorem 11.11 to sampling a random submatrix A and 
computing its SVD (actually, only left singular vectors are needed). The algorithm 
works well if the numerical rank r = r(A) = ||x4||^/||x4||| of the matrix A is small. 
This is the case, in particular, when A is essentially a low-rank matrix, because r(A) < 
rank (A). 

First, the algorithm samples d = 0(r log r) random rows of A. Namely, it takes d 
independent samples of the random vector y whose law is 

L = _AA = Mil 

V IIAIlJ pf F 


P 
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where Aj is the j-th row of A. This sampling can be done in one pass through A if 
the matrix is stored row-by-row, and in two passes if its entries are stored in arbitrary 
order Section 5.1]. 

Then the algorithm computes the SVD of the d x n matrix A, which consists of 
the normalized sampled rows. This can be done in time 0(dn)+ the time needed to 
compute the SVD of a d x d matrix. The latter can be done by one of the known 
methods. This takes significantly less time than computing SVD of the original m x n 
matrix A. In particular, the running time of this algorithm is linear in the dimensions 
of the matrix (and polynomial in d). 

3.5. Optimality of the sample complexity. The sample complexity d = O(rlogr) 
in Theorem 11.11 is best possible: 

Proposition 3.9. There exist matrices A with arbitrarily big numerical rank r = 
II an d suc h that whenever 

d < —rlogr, 

the conclusion (11. 3p of Theorem M.il fails for k — n and for all £ G (0,1). 


Proof. Let n,m G N be arbitrary numbers such that n < m. We define the m x n 
matrix by its entries as follows: 



where d %3 = 1 if i — j and S tJ = 0 otherwise. 

Then each row of A contains exactly one entry of value y^A, and each row is repeated 
m/n times. The j'-th column of A contains exactly one block of values ypA in positions 
i G ( — (j — 1), — j] =: Ij■ In particular, the columns are orthonormal. Also, ||A|| 2 = 1, 
||A||i? = y/n, thus r = n. 

Now we form a submatrix A as described in Theorem 1.1 - by picking d rows of 
A independently and with uniform distribution. If d < An log n, then with high 
probability there exists at least one block Ij from which no rows i are picked. Call this 
block Ij 0 . It follows that j 0 _ th column of A is zero. Consider the coordinate vector 
e jo = (0,..., 0,1, 0,..., 0) of n positions, with 1 at position j 0 . Then 6j 0 G ker A C 
ker P k C ker(AP fc ). Thus ||(A — AP k )ej 0 1| 2 = ||Aej 0 || 2 = 1. Hence 

||A - AP k || 2 > 1, while cr n+ i(A) = 0, ||A|| 2 = 1. 


Hence (j 1.3 [) fails for k — n and for all £ G (0,1). 
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4. The decay of the cut norm 

In this section, we prove Theorem 11.51 on the cut norm of random submatrices and 
show that it is optimal. Our argument will be based on the tools of probability in 
Banach spaces: decoupling, symmetrization, and Slepian’s Lemma (more precisely, its 
version for the Rademacher random variables due to M.Talagrand). 


4.1. Proof of Theorem 11.51 It is known and easy to check that 


1 

4 


oo —>1 < Pile < P||oo->l) 


where pP^i denotes the operator norm of A from into ££>: 


PIU^i := sup 

xSK n 


jkMji 

||^||oo 


sup px||i 


(recall that B ^ denotes the unit ball of fp. Note also that both these norms are 
self-dual: 


= Pile, Ploo-H = Plloo^l. 


So we can prove Theorem 11.51 for the norm || • instead of the cut norm. 

We shall use the following decoupling lemma due to Bourgain and Tzafriri j7j. 


Lemma 4.1. Let (£*) be a finite sequence of bounded i.i.d. random variables, and (£') 
be its independent copy. Then for any sequence of vectors (xp in a Banach space with 

Xu = 0 , 


E|| fifijXjj 
hj 


< 20E| 


Let hi... S n be independent Bernoulli random variables, which take value 1 with 
probability 6 := q/n. Let Pa denote the coordinate projection on the random set of 
coordinates {j \ 5j = 1}. 

Denote by D(A) the diagonal part of A. Then 


Pa^Pa — PaP — D(A))P/± + P\D (A)Pa — ^ SidjAije^ 0 ej + ^ 0 e j. 

i^j i= 1 

We can use Lemma [4.II to estimate the first summand, taking x l0 = A l] e l 0 e.j if i j 
and Xij — 0 if i — j. To this end, let (h') be an independent copy of p), and let Pa> 
denote the coordinate projection on the random set of coordinates {j | h( = 1}. Then 
by Lemma 14.11 and by the triangle inequality we obtain 

n 

E||PaRPa||oo^i < 20E||P A p - PP))P A '||oo^i + \M- 

i= 1 
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Clearly, Y^i=\ |Ai| = ||-D(-4)||oo->i- Thus to complete the proof, we can assume that 
the diagonal of A is zero, and prove the inequality as stated in the theorem for 
EHPaAPa'IIoo- 1 , i.e. 

E||PaALP A '||oo^i < CT 2 ||A|| 00 _ 1 + C«5 3 / 2 (P||coi + P T ||coi). (4.1) 

Note that 

n 

EHAPa'Hoo-u = E sup V \(AP A rx,ei)\, 

. =1 

hence 


E||P a APa'||oo^i = E sup 5i\(AP A 'X,ei)\ 

x£B ~ i= i 
n 

= E sup y'(<5 i -<5)|(^PA'a;,e i )| + (5-E||APA'||oo- 1 . (4-2) 

T (Z IRn Z J 
Xt -°oo i=l 

We proceed with a known symmetrization argument, which we used in the beginning 
of Section [3721 Since Si — 5 are mean zero, we can replace S by 5", an independent copy 
of Si, which can only increase the quantity in (14.2D . Then the first term in (I4.2K does 
not exceed 

n 

E sup yfoj - 5")\(AP A 'X,ei)\. (4.3) 

x£B ™ i=l 

The random variable S, — 5" is symmetric, hence it is distributed identically with 
£i{Si — 5"), where £i are —1,1-valued symmetric random variables independent of all 
other random variables. Therefore the expression in (14.3ft bounded by 


n n 

E sup y^'£jSi\(AP A 'X, 6j )| + E sup y^£iSi\(AP A ’X, ej)\ 

Z6.B" i=1 x€B^ i=1 

n 

< 2E sup y2 £ iSj\{AP A 'X,ei)\. (4.4) 


To estimate this, we use Slepian’s inequality for Rademacher random variables proved 
by Talagrand. This estimate allows us to remove the absolute values in (14.4j) . Precisely, 
a partial case of Slepian’s inequality due to Talagrand (see HU, equation (4.20)) states 
that, for arbitrary jq,. .., y n G M n , one has 


E sup 

ZS.B2. 


E 

i— 1 


£i\{x,yi)\ < E sup £i(x,yi) = E V" 

X&BT ^ 11 —' 




i =1 


i= 1 
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Therefore 


XGB? 


i= 1 


E sup £iSi\(AP A ,x, ej)| = E sup eMx, P A iA t 5^) 

^ 1_> '" f j CL Dn ' J 

00 2=1 

n 

Pa'A t ( ^ 

2=1 

n n 


< E 


3=1 i=l 

n n 

5 - E I] | J2 £iSiA 

j =i *=i 


y 


n n 


3=1 i=l 


V 


2 \ 1/2 


n n 


*-E( e I>a 

3=1 i=l 


n n 


<s 3/2 -£(£A« I 

3=1 i=l 


9 X / 2 

2 ' (averaging over (£j)) 
1/2 


= s 3/2 Mllco]. 


We have proved that the first term in (14.21) does not exceed <5 3//2 || A||c 0 i- To estimate 
the second term, note that 

n 

E||AP A /||oo^i = E||P A 24 t || 0O ^i = E sup y^8 i \{A T x,e i )\. 

^6SSc i=1 

So we can essentially repeat the argument above to bound this expression by 

< <5 1//2 11||Coi + & ||^4 T ||oo->i = ||^4 T ||coi + 5||A||oo^i. 

Putting this together, we can estimate (14.21) as 


Eiip a ^fviL-.i < <y 3/2 imiicoi + i(i 1/2 M T iic„i + iMiu-i) 

< ^Mll Col + ^ /2 p T ||col + ^P|| 00 ^i 

as desired. This completes the proof of Theorem 11.51 


4.2. Optimality. All terms appearing in Theorem 1 1.81 are necessary. Their optimality 
can be witnessed on different types of matrices. To see that the first term is necessary, 
consider a matrix A, all whose entries are equal 1. For this matrix ||A|| C = n 2 , and for 
any Q C \\A QxQ \\ c = \Q\ 2 . 

The optimality of the second term can be seen in the case when A is the identity 
matrix. In this case ||A|| C = n, while ||Aq x q|| = \Q\. 
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To prove that the third term is also necessary, assume that A = (£ij) is a random 
±1 matrix. Then ||T>(t 4)|| c , = n, and ||A|| Col = ||A T || Col = n 3 / 2 . It is easy to show that 
E £ \\A\\ C < Cn 3 / 2 , so for q < n the third term in Theorem 11.81 is dominant. Indeed, by 
Azuma’s inequality, for any x, y G {0, l} n 




n 


£ ij X iVj 


i,j =1 


> t) < Ce~ t2/2 . 


Hence, 

P e (||A|| c > sn 3/2 ) <4 n -Ce~ s2n , 
which implies the desired bound for the expectation. 

Now fix a ±1 matrix A such that || A\\ c < Cn 3 / 2 . Let Q be any subset of {1,..., n}. 
Recall that the norms ||A|| C and ||A|| oo _ >1 are equivalent. We claim that 

PIqxoIL^ > 


Indeed, let Si, i G Q be independent ±1 random variables, 
inequality 


E <5 Sij8i 

jeQ i£Q 


> 


Then by Khinchine’s 


Choose x G { — 1,1} Q such that JT g g 



> -^\Q\ 3/2 . For jeQ set 



Then 


\\A\ 


QxQ | 


.!> 


£ ij X iVj 

i,j&Q 


> 


Therefore, 

e Q \\ a \qxq\\ c > (“) ' ( Pllcoi + ll^llcoi)- 

Therefore, the third term is also necessary. 


5. The decay of the spectral norm 

In this section, we prove Theorem 11.81 on the spectral norm of random submatrices. 
By homogeneity we can assume that ||A|| 2 = 1. Let Si,...,S n be {0,1}-valued 
independent random variables with ET,- — 5 — So our random set is Q = {j \ Sj = 1}. 
Let xi... x n denote the columns of A. Then 

n n 

A = 6j ® Xj, A\q = Sj6j ® Xj. 

3 = 1 j =1 
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The spectral norm can be computed as 


2 = \\A T A\$? = | £ 

1= 1 


Xj 09 Xj 


1/2 

> 

2 


and similarly 


jjXj 09 Xj 


1/2 

2 


pioib - 1 y>i 

i=i 

To estimate the latter norm, we shall first apply the standard symmetrization argument 
(see HZj Lemma 6.3), like we did in the beginning of Section [3T2l and in Section^ Then 
we will apply Lemma 1331 Set 

E=n\A\ Q h. 

The symmetrization argument yields 


E ~ E 

i=i 


Xj &) Xj 


1/2 

2 


+ V5\\A\\l /2 < 2E*(e £ | J^ejS. 

i=i 


jXj Q9 Xj 


\ !/ 2 r- 

j + Vs. 


Now we apply Lemma [3751 with p = 1 to bound E £ Xl?=i £ j$j x j ® x j f° r fixed ( 5j ). 

_ ■ 2 

By Remark 13.61 we can assume k in this Lemma equal 

n(S) := e + Sj. 

j<n 

Then using Cauchy-Schwartz inequality we obtain 

i/ 2 \ i/ 2 
2 


E < E d - 


og n(S) ■ max Sj ||xj|| 2 


1 = 1 
1/2 


■jXj 09 Xj 


2n 1/2 

] + 


< C^E^i/logn^) • max <5j ||xj|| 2 j j ^E^jj 


jXj 09 Xj 


i=i 


\ !/ 2 

J + vft (5.1) 


l/2\ 1/2 
2 


To estimate the fist term in the product here, we use the following 


Lemma 5.1. Let a± > ci 2 > ... > a n > 0 and let Si ... S n be independent Bernoulli 
random variables taking value 1 with probability S > 2 jn. Then 


1/5 


i=i 


1/5 

— A/log Sn ■ Oj < E ( \/log n(ci) ■ max S 3 a 3 j < 4 S \/log Sn ■ a 3 . 

J i=i 


Proof. To prove the upper estimate note that 

1/5 

max SjOj < SjOj + ai/g. 


i=i 
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Hence, 


1/5 

E f v / logn(^) • max Sjdj J < E^^/logn(<5) • a j ) + ays ■ Ev/logn(5). (5.2) 

' 3 i=i 

Jensen’s inequality yields 

E A/log n(d) < 


\ 


log(E Y & + e) < 2+log Jn. 


(5.3) 


i =1 


By the linearity of expectation, the first term in the right hand side of (15.21) equals 


1/5 


3 =1 


1/5 


Y ajE ( 8j<\/log n(8)j < Y a J E U lo g(E 8i + l + e) , 


3 =1 




where we estimated n(8) replacing 8j by 1. Taking the expectation first with respect 
to 8j and then with respect to the other 8i, and using Jensen’s inequality, we bound 
the last expression by 


1/5 1/5 

8 "Y, a j ' \/log{8n + 1 + e) < 28 a,j ■ A/log 8n. 

3 =i 3 =i 


(5.4) 


Finally, substituting (j5.3[) and (j5.4[) into (15.21) . we obtain 

1/5 1/5 

E f A/log n{8) ■ max Sjdj J < (^28 aj + 2ai/5^ • A/log 8n < 4 8 aj ■ \flog 8n. 


3 =1 


3 =1 


To prove the lower bound, we estimate the product in Lemma 15.11 from below to 
make the terms independent. We have 


E ( \/logn(d) ■ max Sjdj ) > E 

j=l...n ) 




A 


log( > 8i + e) • max 

^ 7=1...1/5 


*=1/5+1 


= E. 


8jdj. 


j log( > + e) • E max 

\ ^ j'=l...l/5 

\ *=1/5+1 

These terms will be estimated separately. Since P (^ILi/ 5+1 ^ — ^ n /2) > 1/2, 


(5.5) 
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Let 1 < k <1/5. Denote by the event {5k = 1, 5j = 0 for 1 < j < 1 /5, j ^ k}. 
Then 

P (A k ) = 6 ■ (1 - S) 1 ^ 1 > 5/e. 

Since the events A 1: ..., A 1 / l 5 are disjoint, 

1/5 1/S 

E max SjCij > 

J / k =1 e j=i 

Substituting this estimate into (15.5ft hnishes the proof of Lemma 15.11 ■ 


Now we can complete the proof of Theorem 11.81 Combining Le mm a 15.11 and (15.111 , 
we get 

^ A 1/2 1/2 

E <c(45y/log5nJ2\\xjh) 'hi 1/2 + V5 = 2C'(yiogHI^II(i/5)) ~ E 1 ' 2 + V5. 

3 = 1 

It can be easily checked that E < aE 1 / 2 + 5 implies E < 4a 2 + 2b. Hence, recalling 
that 5 — q/n, we conclude that 

E < 1QC 2 \J\ogq ■ ||A|| (n/g) + 2\fqjn. 

This completes the proof of Theorem 11.81 
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